48 research outputs found

    Ordonnancement de threads OpenMP et placement de données coordonnés sur architectures hiérarchiques

    Get PDF
    National audienceExploiter le potentiel des machines multiprocesseurs hiérarchiques nécessite une répartition précise des threads et des données sur l'architecture non-uniforme sous-jacente afin d'éviter des pénalités d'accès mémoire. Les langages à base de directives comme OpenMP fournissent au programmeur une façon simple de structurer le parallélisme de leurs applications et de transmettre cette information au support d'exécution. Notre support exécutif, basé sur une ordonnanceur de threads multi-niveaux combiné à un gestionnaire mémoire spécialement conçu pour les architectures NUMA, convertit cette information en indications à l'ordonnanceur pour respecter les affinités entre threads et données. Il offre une distri- bution dynamique de la charge de travail guidée par la structure de l'application et la topologie de la machine cible, dans le but d'atteindre la portabilité des performances. Les premières expériences mon- trent qu'une approche mixte, faisant intervenir conjointement déplacement de threads et migration de données se comporte mieux que les politiques de distribution de données basées sur next-touch, laissant entrevoir la possibilité de nouvelles optimisations

    Exécution structurée d'applications OpenMP à grain fin sur architectures multicoeurs

    Get PDF
    Les architectures multiprocesseurs contemporaines, qui se font naturellement l'écho de l'évolution actuelle des microprocesseurs vers des puces massivement multicœur, exhibent un parallélisme de plus en plus hiérarchique. Pour s'approcher des performances théoriques de ces machines, il faut désormais extraire un parallélisme de plus en plus fin des applications, mais surtout communiquer sa structure — et si possible des directives d'ordonnancement — au support d'exécution sous-jacent. Dans cet article, nous expliquons pourquoi OpenMP est un excellent vecteur pour extraire des applications du parallélisme massif, structuré et annoté et nous montrons comment, au moyen d'une extension du compilateur GNU OpenMP s'appuyant sur un ordonnanceur de threads NUMA-aware, il est possible d'exécuter efficacement des applications dynamiques et irrégulières en préservant l'affinité des threads et des données

    An Efficient OpenMP Runtime System for Hierarchical Arch

    Get PDF
    Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application

    De l'exécution structurée de programmes OpenMP sur architectures hiérarchiques

    Get PDF
    This document illustrates the integration of the Marcel bubble structuration into the GNU OpenMP compiler, to express affinity relations between OpenMP teammates, and introduces Affinity, an OpenMP-dedicated bubble scheduler, made to distribute these bubbles over hierarchical architectures, favouring affinity relations

    Design methodology for workload-aware loop scheduling strategies based on genetic algorithm and simulation

    Get PDF
    International audienceIn high-performance computing, the application's workload must be evenly balanced among threads to deliver cutting-edge performance and scalability. In OpenMP, the load balancing problem arises when scheduling loop iterations to threads. In this context, several scheduling strategies have been proposed, but they do not take into account the input workload of the application and thus turn out to be suboptimal. In this work, we introduce a design methodology to propose, study, and assess the performance of workload-aware loop scheduling strategies. In this methodology, a genetic algorithm is employed to explore the state space solution of the problem itself and to guide the design of new loop scheduling strategies, and a simulator is used to evaluate their performance. As a proof of concept, we show how the proposed methodology was used to propose and study a new workload-aware loop scheduling strategy named smart round-robin (SRR). We implemented this strategy into GNU Compiler Collection's OpenMP runtime. We carry out several experiments to validate the simulator and to evaluate the performance of SRR. Our experimental results show that SRR may deliver up to 37.89% and 14.10% better performance than OpenMP's dynamic loop scheduling strategy in the simulated environment and in a real-world application kernel, respectively

    Description, Implementation and Evaluation of an Affinity Clause for Task Directives

    No full text
    International audienceOpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally , we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability

    Description, Implementation and Evaluation of an Affinity Clause for Task Directives

    Get PDF
    International audienceOpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally , we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability

    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

    Get PDF
    International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches

    Phase-TA: Periodicity Detection and Characterization for HPC Applications

    Get PDF
    International audienceThe world of High-Performance Computing (HPC) currently stands on the edge of the ExaScale. The supercomputers are growing ever more powerful, requiring power-efficient components and ever smarter tool-suites to operate them. One of the key features of those frameworks will be their ability to monitor and predict the behavior of executed applications to optimize resources utilization, and abide by the operating constraints, notably on power consumption. In this context, this article presents Phase-TA, an offline tool which detects and characterizes the inherent periodicities of iterative HPC applications, with no prior knowledge of the latter. To do so, it analyzes the evolution of several performance counters at the scale of the compute node, and infers patterns representing the identified periodicities. As a result, Phase-TA offers a nonintrusive mean to gain insights on the processor use associated with an application, and paves the way to predicting its behavior. Phase-TA was tested on a panel of 3 applications and benchmarks from the supercomputing field: HPCG, NEMO, and OpenFoam. For all of them, periodicities, accountable for on average 78% of their execution time, were detected and represented by accurate patterns. Furthermore, it was demonstrated that there is no need to analyze the whole profile of an application to precisely characterize its periodic behaviors. Indeed, an extract of the aforementioned profile is enough for Phase-TA to infer representative patterns on-the-fly, opening the way to energyefficiency optimization through Dynamic Voltage-Frequency Scaling (DVFS)
    corecore